Customer segmentation on Online Retail Data Set

by Nelson Cárdenas

Customer segmentation (or market segmentation) are techniques to split customers into clusters based on similarities to get a sense of their behavior. In this notebook, we are going to analyze patterns in the Online Retail Data Set from the UCI Machine Learning Repository. As UCI specifies about their dataset:

  • "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Columns description

  • InvoiceNo: A unique identifier for the invoice. An invoice number shared across rows means that those transactions were performed in a single invoice (multiple purchases).
  • StockCode: Identifier for items contained in an invoice.
  • Description: Textual description of each of the stock items.
  • Quantity: The quantity of the item purchased.
  • InvoiceDate: Date of purchase.
  • UnitPrice: Value of each item.
  • CustomerID: Identifier for customer making the purchase.
  • Country: Country of customer.

This notebook is based on work made by some authors, see section "References"

In [1]:
import pandas as pd
import plotly.express as px
from datetime import timedelta
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import math
sns.set_theme()
import warnings
warnings.filterwarnings("ignore")

We load the dataset and describe the type of data

In [2]:
pd.__version__
Out[2]:
'1.0.5'
In [3]:
df_ori = pd.read_excel(io='Online Retail.xlsx')
In [4]:
df_copy = df_ori.copy()
df_ori[:3]
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
In [5]:
def describe(df, pred=None):
    """this function describes a dataframe basic information"""
    obs = df.shape[0]
    types = df.dtypes
    counts = df.apply(lambda x: x.count())
    uniques = df.apply(lambda x: [x.unique()])
    nulls = df.apply(lambda x: x.isnull().sum())
    distincts = df.apply(lambda x: x.unique().shape[0])
    missing_ratio = (df.isnull().sum()/ obs) * 100
    skewness = df.skew()
    kurtosis = df.kurt()
    print('Data shape:', df.shape)
    
    if pred is None:
        cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis']
        output = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis], axis = 1, sort=True)
    else:
        corr = df.corr()[pred]
        output = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis, corr], axis = 1, sort=True)
        corr_col = 'corr '  + pred
        cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis', corr_col ]
    
    output.columns = cols
    dtypes = output.types.value_counts()
    print('___________________________\nData types:\n\n',output.types.value_counts())
    print('___________________________')
    return output

details = describe(df_ori)
display(details.sort_values(by='missing ratio', ascending=False))
Data shape: (541909, 8)
___________________________
Data types:

 object            4
float64           2
int64             1
datetime64[ns]    1
Name: types, dtype: int64
___________________________
types counts distincts nulls missing ratio uniques skewness kurtosis
CustomerID float64 406829 4373 135080 24.926694 [[17850.0, 13047.0, 12583.0, 13748.0, 15100.0,... 0.029835 -1.179982
Description object 540455 4224 1454 0.268311 [[WHITE HANGING HEART T-LIGHT HOLDER, WHITE ME... NaN NaN
Country object 541909 38 0 0.000000 [[United Kingdom, France, Australia, Netherlan... NaN NaN
InvoiceDate datetime64[ns] 541909 23260 0 0.000000 [[2010-12-01 08:26:00, 2010-12-01 08:28:00, 20... NaN NaN
InvoiceNo object 541909 25900 0 0.000000 [[536365, 536366, 536367, 536368, 536369, 5363... NaN NaN
Quantity int64 541909 722 0 0.000000 [[6, 8, 2, 32, 3, 4, 24, 12, 48, 18, 20, 36, 8... -0.264076 119769.160031
StockCode object 541909 4070 0 0.000000 [[85123A, 71053, 84406B, 84029G, 84029E, 22752... NaN NaN
UnitPrice float64 541909 1630 0 0.000000 [[2.55, 3.39, 2.75, 7.65, 4.25, 1.85, 1.69, 2.... 186.506972 59005.719097
In [6]:
df_ori.describe()
Out[6]:
Quantity UnitPrice CustomerID
count 541909.000000 541909.000000 406829.000000
mean 9.552250 4.611114 15287.690570
std 218.081158 96.759853 1713.600303
min -80995.000000 -11062.060000 12346.000000
25% 1.000000 1.250000 13953.000000
50% 3.000000 2.080000 15152.000000
75% 10.000000 4.130000 16791.000000
max 80995.000000 38970.000000 18287.000000

For Quantity and Unit price we have negative values, this can be caused by product returns. We must avoid these values, but first we are going to look at the data

In [7]:
df_ori = df_copy.copy()
bool_error = (df_ori['Quantity']<=0) | (df_ori['UnitPrice']<=0)
print('With Quantity negative and UnitPrice negative:', df_ori[(df_ori['Quantity']<0) & (df_ori['UnitPrice']<0)].shape[0])
print('With Quantity or UnitPrice negative or equals to zero:', df_ori[bool_error].shape[0])
print('Values errors by percentage:', str(df_ori[bool_error].shape[0]/df_ori.shape[0]*100)[:5], '%')
print('CustomerIDs with these anomalous values are:', df_ori[bool_error]['CustomerID'].unique())
With Quantity negative and UnitPrice negative: 0
With Quantity or UnitPrice negative or equals to zero: 11805
Values errors by percentage: 2.178 %
CustomerIDs with these anomalous values are: [14527. 15311. 17548. ... 12985. 15951. 16446.]

Values without customer information or with zero or negative quantity opr price will be removed

In [8]:
df_ori = df_ori[~(bool_error | df_ori['CustomerID'].isnull())]
In [9]:
details = describe(df_ori)
display(details.sort_values(by='distincts', ascending=False))
Data shape: (397884, 8)
___________________________
Data types:

 object            4
float64           2
int64             1
datetime64[ns]    1
Name: types, dtype: int64
___________________________
types counts distincts nulls missing ratio uniques skewness kurtosis
InvoiceNo object 397884 18532 0 0.0 [[536365, 536366, 536367, 536368, 536369, 5363... -0.178524 -1.200748
InvoiceDate datetime64[ns] 397884 17282 0 0.0 [[2010-12-01 08:26:00, 2010-12-01 08:28:00, 20... NaN NaN
CustomerID float64 397884 4338 0 0.0 [[17850.0, 13047.0, 12583.0, 13748.0, 15100.0,... 0.025729 -1.180822
Description object 397884 3877 0 0.0 [[WHITE HANGING HEART T-LIGHT HOLDER, WHITE ME... NaN NaN
StockCode object 397884 3665 0 0.0 [[85123A, 71053, 84406B, 84029G, 84029E, 22752... NaN NaN
UnitPrice float64 397884 440 0 0.0 [[2.55, 3.39, 2.75, 7.65, 4.25, 1.85, 1.69, 2.... 204.032727 58140.396673
Quantity int64 397884 301 0 0.0 [[6, 8, 2, 32, 3, 4, 24, 12, 48, 18, 20, 36, 8... 409.892972 178186.243253
Country object 397884 37 0 0.0 [[United Kingdom, France, Australia, Netherlan... NaN NaN
In [10]:
df_ori.describe()
Out[10]:
Quantity UnitPrice CustomerID
count 397884.000000 397884.000000 397884.000000
mean 12.988238 3.116488 15294.423453
std 179.331775 22.097877 1713.141560
min 1.000000 0.001000 12346.000000
25% 2.000000 1.250000 13969.000000
50% 6.000000 1.950000 15159.000000
75% 12.000000 3.750000 16795.000000
max 80995.000000 8142.750000 18287.000000
In [11]:
stock_desc = df_ori.groupby(['StockCode','Description']).count().reset_index()
stock_desc_count = stock_desc['StockCode'].value_counts().reset_index()
stock_desc_count[stock_desc_count['StockCode']>1][:2]
Out[11]:
index StockCode
0 23236 4
1 23196 4
In [12]:
df_ori[df_ori['StockCode'] == 23196]['Description'].unique()
Out[12]:
array(['RETRO LEAVES MAGNETIC NOTEPAD',
       'RETO LEAVES MAGNETIC SHOPPING LIST',
       'LEAVES MAGNETIC  SHOPPING LIST', 'VINTAGE LEAF MAGNETIC NOTEPAD'],
      dtype=object)
In [13]:
temp = df_ori['Description'].groupby(df_ori['StockCode']).unique().apply(pd.Series)
temp = temp[0].to_dict()
df_ori['Description'] = df_ori['StockCode'].map(temp)

df_ori['CustomerID'] = df_ori['CustomerID'].astype('int32')
In [14]:
df_ori[df_ori['StockCode'] == 23196]['Description'].unique()
Out[14]:
array(['RETRO LEAVES MAGNETIC NOTEPAD'], dtype=object)

We validate the result with the describe function

In [15]:
details = describe(df_ori)
display(details.sort_values(by='distincts', ascending=False))
Data shape: (397884, 8)
___________________________
Data types:

 object            4
int64             1
int32             1
float64           1
datetime64[ns]    1
Name: types, dtype: int64
___________________________
types counts distincts nulls missing ratio uniques skewness kurtosis
InvoiceNo object 397884 18532 0 0.0 [[536365, 536366, 536367, 536368, 536369, 5363... -0.178524 -1.200748
InvoiceDate datetime64[ns] 397884 17282 0 0.0 [[2010-12-01 08:26:00, 2010-12-01 08:28:00, 20... NaN NaN
CustomerID int32 397884 4338 0 0.0 [[17850, 13047, 12583, 13748, 15100, 15291, 14... 0.025729 -1.180822
StockCode object 397884 3665 0 0.0 [[85123A, 71053, 84406B, 84029G, 84029E, 22752... NaN NaN
Description object 397884 3647 0 0.0 [[WHITE HANGING HEART T-LIGHT HOLDER, WHITE ME... NaN NaN
UnitPrice float64 397884 440 0 0.0 [[2.55, 3.39, 2.75, 7.65, 4.25, 1.85, 1.69, 2.... 204.032727 58140.396673
Quantity int64 397884 301 0 0.0 [[6, 8, 2, 32, 3, 4, 24, 12, 48, 18, 20, 36, 8... 409.892972 178186.243253
Country object 397884 37 0 0.0 [[United Kingdom, France, Australia, Netherlan... NaN NaN

Analyze sales

We analyzed plotting of sales with respect to some of the other variables.

  • Market representation and country
In [16]:
df_ori['Internal'] = 'No'
df_ori.loc[df_ori['Country'] == 'United Kingdom', 'Internal'] = 'Yes'
fig = px.pie(df_ori, names='Internal', title='Market representation', color_discrete_sequence=px.colors.qualitative.Pastel)
fig.show()

df_ori['Amount'] = df_ori['UnitPrice']*df_ori['Quantity']
temp = pd.DataFrame(df_ori.groupby('Country')['Amount'].sum()).reset_index().sort_values(by=['Amount'], ascending=False)
fig = px.bar(temp, x='Country', y='Amount', title='Amount sales by country',  color_discrete_sequence=px.colors.qualitative.Pastel)
fig.show()
  • Top Customers
In [17]:
temp = df_ori.groupby('CustomerID')[['Amount']].sum().sort_values(by=['Amount'], ascending=False)
ratio_sales_inplot = str(list(temp[:50].sum())[0] / list(temp.sum())[0] * 100)[:5] + ' %'
fig = px.bar(temp[:50].reset_index(), x='CustomerID', y='Amount', title='50 Best Customers by amount ('+ ratio_sales_inplot + ' of total amount of sales)', color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(xaxis_type = 'category')
fig.show()

ratio_sales_inplot = str(list(temp[:10].sum())[0] / list(temp.sum())[0] * 100)[:5] + ' %'
fig = px.bar(temp[:10].reset_index(), x='CustomerID', y='Amount', title='10 Best Customers by amount ('+ ratio_sales_inplot + ' of total amount of sales)', color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(xaxis_type = 'category')
fig.show()

temp = df_ori.groupby('CustomerID')[['Amount']].count().sort_values(by=['Amount'], ascending=False)
temp
ratio_sales_inplot = str(list(temp[:10].sum())[0] / list(temp.sum())[0] * 100)[:4] + ' %'
fig = px.bar(temp[:10].reset_index(), x='CustomerID', y='Amount', title='10 Best Customers by frecuency of sales ('+ ratio_sales_inplot + ' of total frequency of sales)', color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(xaxis_type = 'category')
fig.show()
  • Top products
In [18]:
temp = df_ori.groupby(['StockCode', 'Description'])[['Amount', 'Description']].sum().sort_values(by=['Amount'], ascending=False).reset_index()
ratio_sales_inplot = str(list(temp[['Amount']][:10].sum())[0] / list(temp[['Amount']].sum())[0] * 100)[:5] + ' %'
fig = px.bar(temp[:10].reset_index(), x='Description', y='Amount', title='10 best products by amount ('+ ratio_sales_inplot + ' of total amount of sales)', color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(xaxis_type = 'category')
fig.update_traces(marker_color='rgb(246,207,113)')
fig.show()

temp = df_ori.groupby(['StockCode', 'Description'])[['Amount']].count().sort_values(by=['Amount'], ascending=False).reset_index()
ratio_sales_inplot = str(list(temp[['Amount']][:10].sum())[0] / list(temp[['Amount']].sum())[0] * 100)[:5] + ' %'
fig = px.bar(temp[:10].reset_index(), x='Description', y='Amount', title='10 best products by frequency ('+ ratio_sales_inplot + ' of total frequency of sales)', color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(xaxis_type = 'category')
fig.update_traces(marker_color='rgb(246,207,113)')
fig.show()

Customer Segementation with RFM

We are going to use the Recency, Frequency, Monetary Model (RFM). As stated in the Wikipedia page, RFM stands for the three dimensions:

  • Recency – How recently did the customer purchase?
  • Frequency – How often do they purchase?
  • Monetary Value – How much do they spend?

This analysis must be careful with the time window because can be biased or inaccurate if we try to span an extremely long duration.

In [19]:
snapshot_date = df_ori['InvoiceDate'].max() + timedelta(days=1)
print(snapshot_date)
2011-12-10 12:50:00
In [20]:
#Extract features for each customer
data_process = df_ori.groupby(['CustomerID']).agg({
        'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
        'InvoiceNo': 'count',
        'Amount': 'sum'})

#renaming
data_process.columns = ['Recency', 'Frequency', 'MonetaryValue']

data_process = data_process
data_process[:3]
Out[20]:
Recency Frequency MonetaryValue
CustomerID
12346 326 1 77183.60
12347 2 182 4310.00
12348 75 31 1797.24

Let's going to look at the distribution of our data

In [21]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
for i, feature in enumerate(list(data_process.columns)):
    sns.distplot(data_process[feature], ax=axes[i])

The data is skewed. Using log transformation we can improve the quality of the data for future analysis

In [22]:
data_process['Recency_log'] = data_process['Recency'].apply(math.log)
data_process['Frequency_log'] = data_process['Frequency'].apply(math.log)
data_process['MonetaryValue_log'] = data_process['MonetaryValue'].apply(math.log)
data_process[:3]
Out[22]:
Recency Frequency MonetaryValue Recency_log Frequency_log MonetaryValue_log
CustomerID
12346 326 1 77183.60 5.786897 0.000000 11.253942
12347 2 182 4310.00 0.693147 5.204007 8.368693
12348 75 31 1797.24 4.317488 3.433987 7.494007
In [23]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
        
for i, feature in enumerate(list(data_process.columns[3:])):
    sns.distplot(data_process[feature], ax=axes[i])

Now let's normalized our data, this improves the k-means performance. We choose to use MinMaxScaler and changed the column names, then we use describe to analyze the results.

In [24]:
scaler = MinMaxScaler()
#scaler = StandardScaler()
data_process_normalized = pd.DataFrame(scaler.fit_transform(data_process))
#renaming
data_process_normalized.columns = ['n_'+ i for i in data_process.columns]
data_process_normalized.describe()
Out[24]:
n_Recency n_Frequency n_MonetaryValue n_Recency_log n_Frequency_log n_MonetaryValue_log
count 4338.000000 4338.000000 4338.000000 4338.000000 4338.000000 4338.000000
mean 0.245406 0.011563 0.007318 0.635951 0.410325 0.469546
std 0.268135 0.029159 0.032081 0.241793 0.147873 0.112364
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.045576 0.002039 0.001084 0.487888 0.315929 0.392678
50% 0.134048 0.005098 0.002394 0.663683 0.414097 0.462699
75% 0.378016 0.012618 0.005917 0.836532 0.513518 0.543051
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Now we are going to plot the inertia for each cluster from 1 to 15

In [ ]:
 
In [25]:
SSE, max_k = [], 15
list_input = list(data_process_normalized.columns[3:])
for k in range(max_k):
    kmeans = KMeans(n_clusters=k+1, random_state=42).fit(data_process_normalized[list_input])
    SSE.append(kmeans.inertia_)

fig = go.Figure(data=go.Scatter(x=list(range(1, max_k+1)), y=SSE, ))
fig.update_traces(marker_size=14)
fig.show()

Using the elbow heuristic We decide to use k = 5 for our model.

In [26]:
model =  KMeans(n_clusters=5, random_state=42).fit(data_process_normalized[list_input])
data_process_normalized['cluster'] = model.predict(data_process_normalized[list_input])
fig = px.scatter_3d(data_process_normalized, x=list_input[0], y=list_input[1], z=list_input[2],
              color='cluster')
fig.show()
In [27]:
data_process_normalized[:3]
Out[27]:
n_Recency n_Frequency n_MonetaryValue n_Recency_log n_Frequency_log n_MonetaryValue_log cluster
0 0.871314 0.000000 0.275443 0.976814 0.000000 0.885101 3
1 0.002681 0.023069 0.015368 0.117002 0.580294 0.627984 1
2 0.198391 0.003824 0.006401 0.728782 0.382920 0.550037 0
In [28]:
data_process_normalized.groupby('cluster').agg({
    'n_Recency': ['mean', 'min', 'max'],
    'n_Frequency': ['mean', 'min', 'max'],
    'n_MonetaryValue': ['mean', 'min', 'max']
})
Out[28]:
n_Recency n_Frequency n_MonetaryValue
mean min max mean min max mean min max
cluster
0 0.286681 0.096515 0.994638 0.009218 0.000765 0.069080 0.004825 0.000469 0.158923
1 0.006067 0.000000 0.018767 0.033619 0.000000 1.000000 0.026043 0.000415 1.000000
2 0.054352 0.016086 0.176944 0.021227 0.003314 0.208514 0.011879 0.000766 0.445788
3 0.618743 0.168901 1.000000 0.001860 0.000000 0.010706 0.001529 0.000000 0.275443
4 0.093495 0.005362 0.211796 0.002602 0.000000 0.012108 0.001495 0.000011 0.024887

With this, we are done. We can define our clients based on the mean and range of values in each cluster. Further analysis includes using scorings based on RFM ranges. Also, we can use more robust analysis for the clustering, using not only RFM but other metrics such as demographics or product features.

References